Genomic Data Exploration

Author

Marnin Wolfe

Previously: Spatial, Multi-trait and Multi-stage Models

Lesson plan

  1. Download example SNP datasets from Canvas and get to know them
  2. Read a SNP dataset into R in a form suitable for genomic analyses.
  3. Diagnose common issues: missingness, minor allele frequencies, heterozygosity.
  4. Visualize population structure using PCA.

Intro

File Formats

As we will discuss in lecture, there are many/several data formats the genomic data are stored in.

It is valuable to familiarize yourself with them.

  • FASTA: Raw sequence reads, aligned to a reference genome sequence

    • A sequence begins with a greater-than character (“>”) followed by a description of the sequence (all in a single line). The lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid, and are typically no more than 80 characters in length. ~Wikipedia

    • Example fasta file. Nucleotide sequences from a single Illumina-type short-read. Usually ~150bp long.
  • VCF: Variance Call Files. After aligning raw reads to a genome and after quality control steps, variant discovery can be conducted. Variant discovery is the process of determining which genomic loci contain naturally varying sequences. VCF files are simply text files, no special formatting. However, b/c the data files get quite large, you’ll often find them in a compressed form (ending usually in .vcf.gz).

    • We are concerned primarily (in this class) with the most common type of variant– the single nucleotide polymorphism (SNP). SNP are usually (not always) bi-allelic in a given population; this means that only two different versions (alleles) are found. E.g. A vs. T or G vs. C.

    • Official specification here: https://samtools.github.io/hts-specs/VCFv4.2.pdf

    • An example depicted the major components of a VCF file.
  • Hapmap: often, in crop science, our genotyping data are processed into a simpler-than-vcf form. The HMP file is a common one.

    • An example of the HMP (hapmap) format, from the GAPIT manual.
    • The first 11 columns display attributes of the SNPs and the remaining columns show the nucleotides observed at each SNP for each taxa. The first row contains the header labels and each remaining row contains all the information for a single SNP. ~GAPIT manual

  • “plink” format: plink is a really useful program for manipulating and conducting some population genomics analyses. The original plink format involves two files a .ped and a .map. In the last 10 years, (plink1.9 –> now plink2) have supplanted that with a binary version.

    • Overview of various commonly used PLINK files. SNP = single nucleotide polymorphism
    • Thanks in part to the binary data format, plink1.9 is lightening fast, even on laptops and for large genomic data. Each of the software formats and file types described above have their uses.

Command Line (No Excel)

To work with these data files, you will not be able to use EXCEL or R (maybe R actually, but not usually for the initial steps or the biggest data).

Working with these data requires command line interface (CLI) and usually a Linux or Unix shell are used.

NOTE: I am working on getting our class access the Auburn High Performance Computing (HPC) cluster. In a future class, we can learn to work with these data types. Doing so will require a unified computing environment.

Installing an compiling these programs on everyone’s laptops will be a challenge because of different hardware and software configurations.

Working at the Command Line: While Mac and Linux are built on a common sub-system (called Unix), Window’s OS is not.

  • Windows: To access a Unix-like command line environment, you can install the “Windows Subsystem for Linux” (WSL): Link to Install Page.
  • If your working a Mac, or are into Linux (🐧) you can simply open the “Terminal” application.

Learning Command Line:

Common Command Line Programs for Genomic Variant Datasets:

There are so many options and its evolving quick. It depends on what you want / need to do.

However, here are a few of the core software I recommend:


Isik, Holland, and Maltecca (2017) Chapter 9: Exploratory Marker Analysis

SNP data in R

Our main hands-on goal today will be to see how SNP data can be represented in R. We will learn a few initial tricks about manipulating and visualizing those data.

Data on Canvas

Download the three datasets (described below) from the “Data” module on Canvas. Place them in the data/ sub-directory of your R project / repository.

I do not use all of them in my demo. They are for you to explore!

  • White Lupins (Lupinus albus):
    • WhiteLupins_KHU164_SNPcalls.Ihapmap: An imputed HapMap (~340MB) on a collection of >400 genotypes at ~250K SNPs. These data were generated by low-pass Illumina short-read sequencing of each sample by HudsonAlpha’s Khufu team. As part of the genotyping service, missing data are imputed by a proprietary Khufu algorithm.
    • wl_metadata.csv: This file provides some meta information that will be useful for exploring the data. Genotypes are labelled according to:
      • Population-of-origin (one developed at AU and another representing global diversity from the NPGS [National Plant Germplasm System]).
      • Plant Type (PlantArch): Lupins vary in their plant architecture. Some have DETERMINATE flowering from only the apical branch, others are INDETERMINATE.
  • Oats (Avena sativa):
    • SunOat2022-3K_01-05.AB.txt: was generated by USDA-ARS using a 3K SNP chip; file resembles a HapMap file and contains genotypes for 480 oat accessions. Some are named, others are experimental lines. These data are attributed to Drs. Ali Babar (UF) and Steve Harrison (LSU) as part of the SunGrains Breeding Cooperative.
  • Cassava (Manihot esculenta):
    • cassava_workshop_data.zip: You’ll want to download and unzip this.
    • Multiple file types are provided. These data are all publicly available and downloaded from www.cassavabase.org. The entire process is documented in a 2022 Workshop on Genomic Selection.
    • The data were generated using a combination of genotyping-by-sequencing and a similar reduced-representation marker platform called “DArTseqLD”. They have been pre-imputed.

Install ASRgenomics

Since we’ve been using asreml I thought it useful to include another R package by the same team (VSNi) - ASRgenomics.

To install, go this VSNi page, and click the “Download” button. Follow the provided instructions. You’ll need to fill the web-form to get the download. Install via Rstudio.

library(tidyverse)
# Genomics helpers (focus today)
library(ASRgenomics)

Read oat data

First off, we’ll read the smallest dataset (the Oat data) into R.

# Be sure to match the file path to your LOCAL machine repository 
oats<-read_delim(here::here("data/SunOat_3KSNP_data","SunOat2022-3K_01-05.AB.txt"))
oats %>% dim
[1] 2989  483
oats[1:5,1:10]
# A tibble: 5 × 10
  `#ID`    CHROM    POS BIG_MAC BOB   BROOKS Caballo Cantara COKER_227 COKER_234
  <chr>    <chr>  <dbl> <chr>   <chr> <chr>  <chr>   <chr>   <chr>     <chr>    
1 avgbs_1… chr4A 4.58e8 NULL    AA    BB     AA      BB      NC        NULL     
2 avgbs_1… chr1C 4.57e8 BB      BB    BB     BB      BB      BB        BB       
3 avgbs_1… chr4A 4.13e8 AA      AA    AA     AA      AA      AA        AA       
4 avgbs_1… chr6C 1.41e8 BB      AA    AA     AA      AA      AA        AA       
5 avgbs_1… chr3D 4.68e8 BB      BB    AA     BB      AA      BB        BB       

This dataset has markers on the rows and genotypes of accessions on the columns.

Convert to “dosage”

ASRgenomics expects a genotype matrix M with individuals in rows, markers in columns, coded 0/1/2, with rownames = individual IDs and colnames = marker IDs. See qc.filtering(), snp.pca(), G.matrix() docs.

Example format:

id snp1 snp2 snp3
G001 0 1 NA
G002 2 0 1

This format is known as “allele dosage”. The numbers represent a digitization of the genotype values (HOM REF, HET, HOM ALT) that is most useful for statistical genetics. “Allele dosage” assumes only two alleles are present and counts the number of “doses” an individual posesses of one of the two alleles; typically the “alternative” allele is counted. “Alternative” refers to the allele that IS NOT represented in the reference genome sequence. Sometimes, the major or minor allele are counted instead.

Point is, something like 0 = AA, 1 = AT, 2 = TT is the encoding we want.

ASRgenomics provides a helper function snp.recode() that can convert a HapMap (ATGC) format to dosage.

This works on the White Lupins dataset.

However, it doesn’t work on the oat dataset I chose as my example. 🤦

Below, I work out a pipeline that manually recodes the data. I recommend you reverse engineer it and know that different data sources pose different manipulation problems.

NOTE: If you’ve got 1000’s of individuals and millions of SNP, R might not be the right place to start.

oats_recoded<-oats %>% 
  # Sort by chrom AND pos
  arrange(CHROM,POS) %>%
  # Need an SNP_ID that can be put in the "rownames" and keeps track of chrom+pos+alleles
  mutate(SNP_ID=paste0(CHROM,"_",POS,":",`#ID`)) %>% 
  # Noticed some markers (rows) were not associated with a chromosome
  ## or had different chrom naming than the rest
  ## to be careful, I'll remove them
  filter(!CHROM %in% c("UN","ChrUn","1A","2C","4D","5C","6A")) %>% 
  # Move the SNP_ID column into the rownames position
  column_to_rownames(var = "SNP_ID") %>% 
  # Remove meta data columns
  select(-`#ID`,-CHROM,-POS)

# Use mutate() on all columns
# Find AA turn it into a 0
# AB = 1
# BB = 2
# Should get NA otherwise
oats_dosage<-oats_recoded %>% 
  mutate(across(.cols=everything(), 
                ~ ifelse(.x=="AA","0",
                         ifelse(.x=="AB","1",
                                ifelse(.x=="BB","2",.))))) %>% 
  mutate(across(.cols=everything(),~as.numeric(.)))

oats_dosage<-t(oats_dosage) %>% 
  as.matrix()

oats_dosage[1:5,1:5]
        chr1A_82978:GMI_ES02_c19502_234 chr1A_4617566:GMI_ES17_c4089_545
BIG_MAC                              NA                                0
BOB                                  NA                                2
BROOKS                               NA                                0
Caballo                              NA                                2
Cantara                              NA                                0
        chr1A_6688611:avgbs_115256 chr1A_10592626:avgbs2_19724
BIG_MAC                          1                           2
BOB                              2                           2
BROOKS                           1                           2
Caballo                          2                           2
Cantara                          1                           2
        chr1A_20166187:GMI_ES17_c2656_146
BIG_MAC                                 1
BOB                                     2
BROOKS                                  1
Caballo                                 2
Cantara                                 1
table(oats_dosage,useNA = 'ifany')
oats_dosage
     0      1      2   <NA> 
590754  81033 700431  56742 

Summary stats

Here are some handy calculations you can now do without any package:

# Proportion missing overall
mean(is.na(oats_dosage))
[1] 0.0397086
# The allele frequency (of the counted alleles)
af<-colMeans(oats_dosage, na.rm = T)/2
summary(af)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.2185  0.5743  0.5390  0.8725  1.0000      51 

Usually, in statistical genetics applications, we want to know the minor allele frequency (MAF).

Why? Statistical reasons. Most datasets have lots of rare alleles. If genotypes at a locus are your predictors in a downstream analysis, you’ll not have good power to determine the phenotype-genotype effect for very rare alleles. Imagine you do an experiment to test for a fertilizer effect. You’ve got 100 plots. A typical design will have 50 plots WITH vs. 50 plots WITHOUT fertilizer. Why? Sampling error and statistical power. If you observe 2 plots WITH fert. and 95 WITHOUT fert. Do you think you have a good estimate? Recall the sensitivity of fixed-effects to imbalanced data. So to is the case with marker genotypes. Stay tuned for more coverage of this concept / problem.

Rare alleles are informative, but unstable for some analyses; hence filtering is needed.This is why QC choices matter downstream.

# Here is a simple function to compute the minor allele frequencies
getMAF<-function (M){
  freq <- colMeans(M, na.rm = T)/2
    maf <- freq
    maf[which(maf > 0.5)] <- 1 - maf[which(maf > 0.5)]
    return(maf) 
}
maf<-getMAF(oats_dosage)
summary(maf)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.00000 0.05021 0.16614 0.19162 0.31746 0.50000      51 
hist(maf)

QC filtering

ASRgenomics provides qc.filtering() to do basic QC filtering and simple imputation.

Imputation is the process of replacing missing data with substituted values.” ~Wikipedia

Imputation is often needed for downstream applications that are only available for ‘complete data’.

We’ll use “mean-imputation” here. This means simply putting the sample mean in place of any missing data value.

In genomics, there are MUCH better ways to impute. Here’s a link to the program Beagle (just an example not an endorsement, Browning, Zhou, and Browning (2018)). Beagle uses what is called a “hidden Markov model” to probabilistically infer missing values; an extremely powerful computational approach for genomics, given genome structure.

# qc.filtering expects 0/1/2 with NAs allowed.
# Key knobs: call rate, MAF, and imputation approach.
qc <- qc.filtering(M=oats_dosage,
                   maf         = 0.01, # keep markers with MAF >= 0.01
                   marker.call = 0.95, # keep SNPs called in >= 95% of indivs
                   ind.call    = 0.80, # keep individuals with >= 80% SNPs called
                   imput       = TRUE) # simple mean imputation
# What did we get back?
names(qc)
[1] "M.clean"          "map"              "plot.missing.ind" "plot.missing.SNP"
[5] "plot.heteroz"     "plot.Fis"         "plot.maf"        
oats_dosage_cleaned<-qc$M.clean
dim(oats_dosage_cleaned)
[1]  480 2521
mean(is.na(oats_dosage_cleaned))
[1] 0

Several pre-made plots are provided. Pretty convenient.

Oats are self-pollinating small grains. Oat breeders typically (not always) wait until new lines have been inbred to the level of S5:6S_{5:6} (>99% homozygosity) before genotyping.

For that reason, a useful sanity check in this case is the rate of heterozygosity.

# According to the manual, 
# this is a plot of observed heterozygosity per SNP (original marker matrix).
qc$plot.heteroz

In fact, Missingness by individual may be much more relevant in this case.

Here’s a function to do missing-by-individual:

getPropHom<-function (M) {
    W <- M
    W[which(W == 2)] <- 0
    f <- 1 - (rowSums(W)/ncol(W))
    return(f) 
}
prop_hom<-getPropHom(oats_dosage_cleaned)
summary(prop_hom)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2483  0.9101  0.9149  0.9092  0.9179  0.9288 
hist(prop_hom)

So, this sanity checks us. But there ARE some pretty heterozygous individuals.

prop_hom %>% .[which(.<0.8)]
       NC14-1787 LA17089SBSS-73-1      TX20CAS1079        NC19-3542 
       0.7936335        0.7751210        0.2483380        0.7395914 

What do you think could explain this?


BTW, I’ve encapsulated several of these useful functions into my own R package, genomicMateSelectR.

If you’re interested, find it here: https://wolfemd.github.io/genomicMateSelectR/

You can install genomicMateSelectR package from my GitHub with:

devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master')

PCA to visualize structure

ASRgenomics offers snp.pca() for PCA on the marker matrix. But you can also use the core R function prcomp().

Principal Components Analysis (PCA): linear dimensionality reduction technique often used in exploratory data analysis, visualization and data preprocessing.

Principal components are equations that define directions through data that explain a maximal amount of variance. The lines that capture most information about the dataset. The most important PC is always first. The numbers of predictors and PCs is always equal, but PCs are sorted such that the first PC explains the maximal amount of the data (the purple marks) and each subsequent PC explains cumulatively less. Depiction of the line that maximized the variance (avg. squared distance from project points (red dots) to the origin.

Learn PCA:

pca_on_snps <- snp.pca(M = oats_dosage_cleaned,ncp = 15)  
# 10 PCs for exploration
names(pca_on_snps)
[1] "pca.scores"  "eigenvalues" "plot.scree"  "plot.pca"   
# PCA Scores
pca_on_snps$pca.scores %>%head
                 PC1        PC2        PC3        PC4         PC5         PC6
BIG_MAC   -0.8115166  0.5312307 -0.2618582  0.3511487 -0.23905513 -0.25916915
BOB       -0.0490075  1.0561610 -0.6368674 -0.1735671 -0.14564473 -0.35001333
BROOKS    -0.3640837 -0.2718282 -0.5761865 -0.4152877 -0.04842491  0.12150877
Caballo   -1.0215414  1.0254975 -0.5482151 -0.1670740 -0.09641643 -0.08068248
Cantara   -0.3632970 -0.2740405 -0.5834303 -0.4173835 -0.04428983  0.13075484
COKER_227 -1.2789803  0.9881427 -0.1254722  0.1159675  0.27366714 -0.20148855
                 PC7           PC8        PC9       PC10       PC11        PC12
BIG_MAC    0.2363907 -0.0006181521  0.1296274 -0.4084578  0.0569676 -0.20705046
BOB        0.5891244 -0.0929030801  1.1146056 -0.4584778  0.1281358 -0.00508864
BROOKS     0.1101220 -0.0862704449 -0.2734634  0.1983333 -0.1086180  0.31441081
Caballo   -0.1622409 -0.0660100764  0.2196498  0.4090137  0.1415139  0.33121767
Cantara    0.1054150 -0.0834733323 -0.2731552  0.1939836 -0.1104019  0.31561090
COKER_227 -0.2493228  0.1131566988  0.3039956 -0.2764918 -0.1882310  0.10051105
                 PC13        PC14      PC15
BIG_MAC    0.26894315 -0.07602738 0.3262893
BOB       -0.03380547 -0.13040304 0.1573659
BROOKS     0.12182094  0.21661256 0.4839442
Caballo    0.34046723 -0.25478573 0.4246931
Cantara    0.12998679  0.21764663 0.4886590
COKER_227  0.01373906 -0.12966889 0.2493736

PCA scores position each individual (oat in this case) in the principal components ‘space’. These PCA scores are also called eigenvector coeffients. Each PC is also called an eigenvector and has an associated eigenvalue which corresponds to the amount of multi-variance variability that is explained by that vector.

pca_on_snps$eigenvalues %>% head
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.3526787        28.220223                    28.22022
Dim.2  0.5804136        12.108862                    40.32908
Dim.3  0.4634914         9.669576                    49.99866
Dim.4  0.2254203         4.702825                    54.70149
Dim.5  0.1822079         3.801308                    58.50279
Dim.6  0.1483505         3.094959                    61.59775

We can use their pre-made plot

pca_on_snps$plot.scree

A “scree plot” displays the percent variance explained by each principal component (eigenvector) in order. As you can see, by definition the first one always explains the most.

pca_on_snps$plot.pca

scores <- pca_on_snps$pca.scores

scores <- scores %>%
  as.data.frame %>% 
  mutate(GID = rownames(scores))

ggplot(scores, aes(x = PC3, y = PC4)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "PCA of SNP matrix", x = "PC3 (9.7%)", y = "PC4 (4.7%)") +
  theme_bw()

On interpretation: Each data point is an individual oat genotype. The position of two datapoints in the plot is related to the genetic similarity between the two samples. Without additional information, this information is not immediately translatable into pedigree information, meaning you can’t tell whether two samples are siblings, parent-offspring, or any other relatedness level. Often, structures like families will appear as clusters of dots.

snp.pca has options to color code and to draw ellipses around groups of related individuals.

Hint for the Hands-on: The White Lupin dataset comes with a metadata file that could be used to group those samples.

Hint for visualizing the oats: Sample names often start with a state-of-origin prefix (e.g. “FL0047-J9” is a line from Florida). I wonder if we can identify clusters-by-origin? You’ll have to use some text processing (grep, grepl) to extract state codes from the names.

rownames(oats_dosage_cleaned)[1:25]
 [1] "BIG_MAC"             "BOB"                 "BROOKS"             
 [4] "Caballo"             "Cantara"             "COKER_227"          
 [7] "COKER_234"           "Coker_716"           "Coronado"           
[10] "FL0047-J9"           "FL0115-J2"           "FL03001BSB-S7"      
[13] "FL03146BSB-S1-B-S1"  "FL03254-L1"          "FL04155-S06-31-B-S1"
[16] "FL0720-R6"           "FL0772-R3"           "FL0914-U2"          
[19] "FL0923-U2"           "FL0923-U4"           "FL0941-U1"          
[22] "FL0943-U2"           "FL0943-U4"           "FL1013-1"           
[25] "FL1013-2"           

Marker distribution

One last thing, to get us primed for mapping quantitative trait loci (QTL) along the genome. Let’s make a plot of the marker distribution across the physical positions of the genome. Are they evenly distributed, or are there signifcant gaps? If designed properly, a genotyping strategy should evenly sample the whole genome.

snp_map<-tibble(SNP_ID=colnames(oats_dosage_cleaned)) %>% 
  separate(SNP_ID,c("ChrPos","SNP_ID"),":") %>% 
  separate(ChrPos,c("Chr","Pos"),"_") %>% 
  mutate(Pos=as.numeric(Pos))

snp_map %>% 
  ggplot(., aes(x = Pos, y = Chr)) +
  geom_point(size = 0.6, alpha = 0.7) +
  labs(title = "Physical distribution of SNP markers",
       x = "Genomic position (bp)",
       y = "Chromosome") +
  theme_bw()

Hands-on

Now it’s your turn!

  1. Explore the available SNP datasets. Use the examples above, but check out the manual, see what you can do with what you’ve got. Try to get to go beyond the PCA we demonstrated above. Identify sub-groups in the data (hints given above) and color-code and see if you can distinguish the groups in genetic space using the PCA results.
  2. Work on learning to interact with the shell (command line). If you need to, install WSL or find another way to get up and running with some of the software mentioned above (e.g. plink).

References

Browning, Brian L., Ying Zhou, and Sharon R. Browning. 2018. “A One-Penny Imputed Genome from Next-Generation Reference Panels.” The American Journal of Human Genetics 103 (3): 338–48. https://doi.org/10.1016/j.ajhg.2018.07.015.
Isik, Fikret, James Holland, and Christian Maltecca. 2017. Genetic Data Analysis for Plant and Animal Breeding. Springer International Publishing. https://doi.org/10.1007/978-3-319-55177-7.
Price, Alkes L, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. 2006. “Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies.” Nature Genetics 38 (8): 904–9. https://doi.org/10.1038/ng1847.
Reich, David, Alkes L Price, and Nick Patterson. 2008. “Principal Component Analysis of Genetic Data.” Nature Genetics 40 (5): 491–92. https://doi.org/10.1038/ng0508-491.